1 00:00:08,691 --> 00:00:15,429 - Hello, hi. So I want to get started. Welcome to CS 231N Lecture 11. 2 00:00:15,430 --> 00:00:23,258 We're going to talk about today detection segmentation and a whole bunch of other really exciting topics around core computer vision tasks. 3 00:00:23,259 --> 00:00:25,590 But as usual, a couple administrative notes. 4 00:00:25,590 --> 00:00:31,358 So last time you obviously took the midterm, we didn't have lecture, hopefully that went okay 5 00:00:31,358 --> 00:00:42,269 for all of you but so we're going to work on grading the midterm this week, but as a reminder please don't make any public discussions about the midterm questions or answers or whatever 6 00:00:42,270 --> 00:00:48,517 until at least tomorrow because there are still some people taking makeup midterms today and throughout the rest of the week 7 00:00:48,518 --> 00:00:53,668 so we just ask you that you refrain from talking publicly about midterm questions. 8 00:00:56,329 --> 00:01:02,920 Why don't you wait until Monday? [laughing] Okay, great. 9 00:01:02,921 --> 00:01:07,760 So we're also starting to work on midterm grading. We'll get those back to you as soon as you can, as soon as we can. 10 00:01:07,761 --> 00:01:14,078 We're also starting to work on grading assignment two so there's a lot of grading being done this week. The TA's are pretty busy. 11 00:01:14,079 --> 00:01:18,479 Also a reminder for you guys, hopefully you've been working hard on your projects now that most of you 12 00:01:18,479 --> 00:01:26,969 are done with the midterm so your project milestones will be due on Tuesday so any sort of last minute changes that you had in your projects, 13 00:01:26,970 --> 00:01:31,650 I know some people decided to switch projects after the proposal, some teams reshuffled a little bit, 14 00:01:31,650 --> 00:01:39,676 that's fine but your milestone should reflect the project that you're actually doing for the rest of the quarter. So hopefully that's going out well. 15 00:01:39,677 --> 00:01:43,900 I know there's been a lot of worry and stress on Piazza, wondering about assignment three. 16 00:01:43,900 --> 00:01:50,188 So we're working on that as hard as we can but that's actually a bit of a new assignment, it's changing a bit from last year 17 00:01:50,189 --> 00:01:53,951 so it will be out as soon as possible, hopefully today or tomorrow. 18 00:01:53,951 --> 00:02:01,550 Although we promise that whenever it comes out you'll have two weeks to finish it so try not to stress out about that too much. 19 00:02:01,551 --> 00:02:05,318 But I'm pretty excited, I think assignment three will be really cool, has a lot of cool, 20 00:02:05,318 --> 00:02:09,079 it'll cover a lot of really cool material. 21 00:02:09,079 --> 00:02:13,340 So another thing, last time in lecture we mentioned this thing called the Train Game 22 00:02:13,340 --> 00:02:17,780 which is this really cool thing we've been working on sort of as a side project a little bit. 23 00:02:17,780 --> 00:02:24,391 So this is an interactive tool that you guys can go on and use to explore a little bit the process 24 00:02:24,391 --> 00:02:27,340 of tuning hyperparameters in practice so we hope that, 25 00:02:27,340 --> 00:02:33,119 so this is again totally not required for the course. Totally optional, but if you do we will offer 26 00:02:33,119 --> 00:02:35,072 a small amount of extra credit for those of you 27 00:02:35,072 --> 00:02:37,963 who want to do well and participate on this. 28 00:02:37,963 --> 00:02:42,224 And we'll send out exactly some more details later this afternoon on Piazza. 29 00:02:42,224 --> 00:02:48,362 But just a bit of a demo for what exactly is this thing. So you'll get to go in and we've changed the name 30 00:02:48,362 --> 00:02:51,752 from Train Game to HyperQuest because you're questing 31 00:02:51,752 --> 00:02:54,464 to solve, to find the best hyperparameters for your model 32 00:02:54,464 --> 00:02:59,344 so this is really cool, it'll be an interactive tool that you can use to explore the training of hyperparameters 33 00:02:59,344 --> 00:03:01,254 interactively in your browser. 34 00:03:01,254 --> 00:03:04,871 So you'll login with your student ID and name. 35 00:03:04,871 --> 00:03:08,830 You'll fill out a little survey with some of your experience on deep learning 36 00:03:08,830 --> 00:03:14,934 then you'll read some instructions. So in this game you'll be shown some random data set 37 00:03:14,934 --> 00:03:16,152 on every trial. 38 00:03:16,152 --> 00:03:21,494 This data set might be images or it might be vectors and your goal is to train a model by picking 39 00:03:21,494 --> 00:03:25,632 the right hyperparameters interactively to perform as well as you can on the validation set 40 00:03:25,632 --> 00:03:28,077 of this random data set. 41 00:03:28,077 --> 00:03:31,382 And it'll sort of keep track of your performance over time and there'll be a leaderboard, 42 00:03:31,382 --> 00:03:33,423 it'll be really cool. 43 00:03:33,423 --> 00:03:38,723 So every time you play the game, you'll get some statistics about your data set. 44 00:03:38,723 --> 00:03:42,397 In this case we're doing a classification problem with 10 classes. 45 00:03:43,424 --> 00:03:47,774 You can see down at the bottom you have these statistics about random data set, we have 10 classes. 46 00:03:47,774 --> 00:03:52,987 The input data size is three by 32 by 32 so this is some image data set and we can see 47 00:03:52,987 --> 00:03:58,832 that in this case we have 8500 examples in the training set and 1500 examples in the validation set. 48 00:03:58,832 --> 00:04:01,518 These are all random, they'll change a little bit every time. 49 00:04:01,518 --> 00:04:06,912 Based on these data set statistics you'll make some choices on your initial learning rate, your initial network size, 50 00:04:06,912 --> 00:04:08,931 and your initial dropout rate. 51 00:04:08,931 --> 00:04:13,811 Then you'll see a screen like this where it'll run one epoch with those chosen hyperparameters, 52 00:04:13,811 --> 00:04:19,712 show you on the right here you'll see two plots. One is your training and validation loss 53 00:04:19,712 --> 00:04:21,040 for that first epoch. 54 00:04:21,040 --> 00:04:23,409 Then you'll see your training and validation accuracy 55 00:04:23,409 --> 00:04:30,759 for that first epoch and based on the gaps that you see in these two graphs you can make choices interactively to change the learning rates and hyperparameters 56 00:04:30,759 --> 00:04:32,290 for the next epoch. 57 00:04:32,290 --> 00:04:37,803 So then you can either choose to continue training with the current or changed hyperparameters, 58 00:04:37,803 --> 00:04:41,523 you can also stop training, or you can revert to go back to the previous checkpoint 59 00:04:41,523 --> 00:04:43,872 in case things got really messed up. 60 00:04:43,872 --> 00:04:48,691 So then you'll get to make some choice, so here we'll decide to continue training 61 00:04:48,691 --> 00:04:51,347 and in this case you could go and set new learning rates 62 00:04:51,347 --> 00:04:54,971 and new hyperparameters for the next epoch of training. 63 00:04:54,971 --> 00:04:59,808 You can also, kind of interesting here, you can actually grow the network interactively 64 00:04:59,808 --> 00:05:01,899 during training in this demo. 65 00:05:01,899 --> 00:05:07,562 There's this cool trick from a couple recent papers where you can either take existing layers 66 00:05:07,562 --> 00:05:12,083 and make them wider or add new layers to the network in the middle of training while still maintaining 67 00:05:12,083 --> 00:05:15,762 the same function in the network so you can do that 68 00:05:15,762 --> 00:05:20,131 to increase the size of your network in the middle of training here which is kind of cool. 69 00:05:20,131 --> 00:05:24,430 So then you'll make choices over several epochs and eventually your final validation accuracy 70 00:05:24,430 --> 00:05:26,811 will be recorded and we'll have some leaderboard 71 00:05:26,811 --> 00:05:29,912 that compares your score on that data set 72 00:05:29,912 --> 00:05:33,072 to some simple baseline models. 73 00:05:33,072 --> 00:05:37,534 And depending on how well you do on this leaderboard we'll again offer some small amounts of extra credit 74 00:05:37,534 --> 00:05:39,774 for those of you who choose to participate. 75 00:05:39,774 --> 00:05:42,322 So this is again, totally optional, but I think 76 00:05:42,322 --> 00:05:46,936 it can be a really cool learning experience for you guys to play around with and explore how hyperparameters 77 00:05:46,936 --> 00:05:49,243 affect the learning process. 78 00:05:49,243 --> 00:05:54,872 Also, it's really useful for us. You'll help science out by participating in this experiment. 79 00:05:54,872 --> 00:06:02,101 We're pretty interested in seeing how people behave when they train neural networks so you'll be helping us out 80 00:06:02,101 --> 00:06:04,422 as well if you decide to play this. 81 00:06:04,422 --> 00:06:08,462 But again, totally optional, up to you. 82 00:06:08,462 --> 00:06:10,295 Any questions on that? 83 00:06:15,080 --> 00:06:18,680 Hopefully at some point but it's. So the question was will this be a paper 84 00:06:18,680 --> 00:06:20,272 or whatever eventually? 85 00:06:20,272 --> 00:06:26,760 Hopefully but it's really early stages of this project so I can't make any promises but I hope so. 86 00:06:26,760 --> 00:06:29,510 But I think it'll be really cool. 87 00:06:33,240 --> 00:06:35,000 [laughing] 88 00:06:35,000 --> 00:06:37,971 Yeah, so the question is how can you add layers during training? 89 00:06:37,971 --> 00:06:43,552 I don't really want to get into that right now but the paper to read is Net2Net by Ian Goodfellow's 90 00:06:43,552 --> 00:06:45,291 one of the authors and there's another paper 91 00:06:45,291 --> 00:06:48,240 from Microsoft called Network Morphism. 92 00:06:48,240 --> 00:06:52,407 So if you read those two papers you can see how this works. 93 00:06:53,680 --> 00:06:58,152 Okay, so last time, a bit of a reminder before we had the midterm last time we talked 94 00:06:58,152 --> 00:06:59,792 about recurrent neural networks. 95 00:06:59,792 --> 00:07:03,032 We saw that recurrent neural networks can be used for different types of problems. 96 00:07:03,032 --> 00:07:07,192 In addition to one to one we can do one to many, many to one, many to many. 97 00:07:07,192 --> 00:07:10,679 We saw how this can apply to language modeling 98 00:07:10,679 --> 00:07:15,460 and we saw some cool examples of applying neural networks to model different sorts of languages at the character level 99 00:07:15,460 --> 00:07:20,571 and we sampled these artificial math and Shakespeare and C source code. 100 00:07:20,571 --> 00:07:26,560 We also saw how similar things could be applied to image captioning by connecting a CNN feature extractor 101 00:07:26,560 --> 00:07:28,491 together with an RNN language model. 102 00:07:28,491 --> 00:07:31,011 And we saw some really cool examples of that. 103 00:07:31,011 --> 00:07:36,040 We also talked about the different types of RNN's. We talked about this Vanilla RNN. 104 00:07:36,040 --> 00:07:40,158 I also want to mention that this is sometimes called a Simple RNN or an Elman RNN so you'll see 105 00:07:40,158 --> 00:07:42,331 all of these different terms in literature. 106 00:07:42,331 --> 00:07:44,997 We also talked about the Long Short Term Memory or LSTM. 107 00:07:44,997 --> 00:07:50,102 And we talked about how the gradient, the LSTM has this crazy set of equations 108 00:07:50,102 --> 00:07:53,021 but it makes sense because it helps improve gradient flow 109 00:07:53,021 --> 00:07:56,022 during back propagation and helps this thing model 110 00:07:56,022 --> 00:07:59,443 more longer term dependencies in our sequences. 111 00:07:59,443 --> 00:08:03,982 So today we're going to switch gears and talk about a whole bunch of different exciting tasks. 112 00:08:03,982 --> 00:08:08,992 We're going to talk about, so so far we've been talking about mostly the image classification problem. 113 00:08:08,992 --> 00:08:13,262 Today we're going to talk about various types of other computer vision tasks where you actually want to go in 114 00:08:13,262 --> 00:08:19,542 and say things about the spatial pixels inside your images so we'll see segmentation, localization, detection, 115 00:08:19,542 --> 00:08:21,942 a couple other different computer vision tasks 116 00:08:21,942 --> 00:08:25,494 and how you can approach these with convolutional neural networks. 117 00:08:25,494 --> 00:08:29,552 So as a bit of refresher, so far the main thing we've been talking about in this class 118 00:08:29,552 --> 00:08:32,163 is image classification so here we're going to have 119 00:08:32,163 --> 00:08:34,842 some input image come in. That input image will go through 120 00:08:34,842 --> 00:08:36,583 some deep convolutional network, 121 00:08:36,583 --> 00:08:42,991 that network will give us some feature vector of maybe 4096 dimensions in the case of AlexNet RGB 122 00:08:42,991 --> 00:08:46,222 and then from that final feature vector we'll have some fully-connected, 123 00:08:46,222 --> 00:08:47,750 some final fully-connected layer 124 00:08:47,750 --> 00:08:50,568 that gives us 1000 numbers for the different class scores 125 00:08:50,568 --> 00:08:55,660 that we care about where 1000 is maybe the number of classes in ImageNet in this example. 126 00:08:55,660 --> 00:08:59,080 And then at the end of the day what the network does is we input an image 127 00:08:59,080 --> 00:09:01,437 and then we output a single category label 128 00:09:01,437 --> 00:09:05,083 saying what is the content of this entire image as a whole. 129 00:09:05,083 --> 00:09:09,879 But this is maybe the most basic possible task in computer vision and there's a whole bunch 130 00:09:09,879 --> 00:09:11,686 of other interesting types of tasks 131 00:09:11,686 --> 00:09:14,314 that we might want to solve using deep learning. 132 00:09:14,314 --> 00:09:18,609 So today we're going to talk about several of these different tasks and step through each of these 133 00:09:18,609 --> 00:09:21,515 and see how they all work with deep learning. 134 00:09:21,515 --> 00:09:26,944 So we'll talk about these more in detail about what each problem is as we get to it 135 00:09:26,944 --> 00:09:28,852 but this is kind of a summary slide 136 00:09:28,852 --> 00:09:31,480 that we'll talk first about semantic segmentation. 137 00:09:31,480 --> 00:09:35,153 We'll talk about classification and localization, then we'll talk about object detection, 138 00:09:35,153 --> 00:09:39,086 and finally a couple brief words about instance segmentation. 139 00:09:39,967 --> 00:09:44,035 So first is the problem of semantic segmentation. 140 00:09:44,035 --> 00:09:49,847 In the problem of semantic segmentation, we want to input an image and then output a decision 141 00:09:49,847 --> 00:09:52,567 of a category for every pixel in that image 142 00:09:52,567 --> 00:09:58,327 so for every pixel in this, so this input image for example is this cat walking through the field, he's very cute. 143 00:09:58,327 --> 00:10:04,517 And in the output we want to say for every pixel is that pixel a cat or grass or sky or trees 144 00:10:04,517 --> 00:10:07,701 or background or some other set of categories. 145 00:10:07,701 --> 00:10:11,922 So we're going to have some set of categories just like we did in the image classification case 146 00:10:11,922 --> 00:10:15,820 but now rather than assigning a single category labeled to the entire image, we want to produce 147 00:10:15,820 --> 00:10:19,569 a category label for each pixel of the input image. 148 00:10:19,569 --> 00:10:22,674 And this is called semantic segmentation. 149 00:10:22,674 --> 00:10:27,340 So one interesting thing about semantic segmentation is that it does not differentiate instances 150 00:10:27,340 --> 00:10:31,523 so in this example on the right we have this image with two cows where they're standing right next 151 00:10:31,523 --> 00:10:36,859 to each other and when we're talking about semantic segmentation we're just labeling all the pixels 152 00:10:36,859 --> 00:10:39,741 independently for what is the category of that pixel. 153 00:10:39,741 --> 00:10:44,510 So in the case like this where we have two cows right next to each other the output does not make 154 00:10:44,510 --> 00:10:46,840 any distinguishing, does not distinguish 155 00:10:46,840 --> 00:10:48,309 between these two cows. 156 00:10:48,309 --> 00:10:51,782 Instead we just get a whole mass of pixels that are all labeled as cow. 157 00:10:51,782 --> 00:10:56,625 So this is a bit of a shortcoming of semantic segmentation and we'll see how we can fix this later 158 00:10:56,625 --> 00:10:58,910 when we move to instance segmentation. 159 00:10:58,910 --> 00:11:02,882 But at least for now we'll just talk about semantic segmentation first. 160 00:11:04,437 --> 00:11:09,340 So you can imagine maybe using a class, so one potential approach for attacking 161 00:11:09,340 --> 00:11:12,544 semantic segmentation might be through classification. 162 00:11:12,544 --> 00:11:17,755 So there's this, you could use this idea of a sliding window approach to semantic segmentation. 163 00:11:17,755 --> 00:11:24,315 So you might imagine that we take our input image and we break it up into many many small, tiny local crops 164 00:11:24,315 --> 00:11:27,763 of the image so in this example we've taken 165 00:11:27,763 --> 00:11:31,310 maybe three crops from around the head of this cow 166 00:11:31,310 --> 00:11:36,564 and then you could imagine taking each of those crops and now treating this as a classification problem. 167 00:11:36,564 --> 00:11:41,246 Saying for this crop, what is the category of the central pixel of the crop? 168 00:11:41,246 --> 00:11:46,752 And then we could use all the same machinery that we've developed for classifying entire images 169 00:11:46,752 --> 00:11:48,760 but now just apply it on crops rather than 170 00:11:48,760 --> 00:11:51,083 on the entire image. 171 00:11:51,083 --> 00:11:56,601 And this would probably work to some extent but it's probably not a very good idea. 172 00:11:56,601 --> 00:12:02,498 So this would end up being super super computationally expensive because we want to label 173 00:12:02,498 --> 00:12:07,319 every pixel in the image, we would need a separate crop for every pixel in that image and this would be 174 00:12:07,319 --> 00:12:09,407 super super expensive to run forward and backward 175 00:12:09,407 --> 00:12:10,910 passes through. 176 00:12:10,910 --> 00:12:17,085 And moreover, we're actually, if you think about this we can actually share computation between different 177 00:12:17,085 --> 00:12:20,476 patches so if you're trying to classify two patches 178 00:12:20,476 --> 00:12:22,950 that are right next to each other and actually overlap 179 00:12:22,950 --> 00:12:25,509 then the convolutional features of those patches 180 00:12:25,509 --> 00:12:30,611 will end up going through the same convolutional layers and we can actually share a lot of the computation 181 00:12:30,611 --> 00:12:32,644 when applying this to separate passes 182 00:12:32,644 --> 00:12:34,742 or when applying this type of approach 183 00:12:34,742 --> 00:12:37,194 to separate patches in the image. 184 00:12:37,194 --> 00:12:41,896 So this is actually a terrible idea and nobody does this and you should probably not do this 185 00:12:41,896 --> 00:12:48,683 but it's at least the first thing you might think of if you were trying to think about semantic segmentation. 186 00:12:48,683 --> 00:12:53,372 Then the next idea that works a bit better is this idea of a fully convolutional network right. 187 00:12:53,372 --> 00:12:58,305 So rather than extracting individual patches from the image and classifying these patches independently, 188 00:12:58,305 --> 00:13:03,604 we can imagine just having our network be a whole giant stack of convolutional layers with no fully connected 189 00:13:03,604 --> 00:13:06,501 layers or anything so in this case we just have a bunch 190 00:13:06,501 --> 00:13:12,633 of convolutional layers that are all maybe three by three with zero padding or something like that 191 00:13:12,633 --> 00:13:15,422 so that each convolutional layer preserves the spatial size 192 00:13:15,422 --> 00:13:17,843 of the input and now if we pass our image 193 00:13:17,843 --> 00:13:20,605 through a whole stack of these convolutional layers, 194 00:13:20,605 --> 00:13:27,184 then the final convolutional layer could just output a tensor of something by C by H by W 195 00:13:27,184 --> 00:13:29,622 where C is the number of categories that we care about 196 00:13:29,622 --> 00:13:34,734 and you could see this tensor as just giving our classification scores for every pixel 197 00:13:34,734 --> 00:13:38,127 in the input image at every location in the input image. 198 00:13:38,127 --> 00:13:43,014 And we could compute this all at once with just some giant stack of convolutional layers. 199 00:13:43,014 --> 00:13:47,216 And then you could imagine training this thing by putting a classification loss at every pixel 200 00:13:47,216 --> 00:13:50,558 of this output, taking an average over those pixels 201 00:13:50,558 --> 00:13:55,137 in space, and just training this kind of network through normal, regular back propagation. 202 00:13:55,137 --> 00:13:55,970 Question? 203 00:13:58,430 --> 00:14:01,179 Oh, the question is how do you develop training data for this? 204 00:14:01,179 --> 00:14:04,366 It's very expensive right. So the training data for this would be 205 00:14:04,366 --> 00:14:06,899 we need to label every pixel in those input images 206 00:14:06,899 --> 00:14:11,831 so there's tools that people sometimes have online where you can go in and sort of draw contours 207 00:14:11,831 --> 00:14:14,613 around the objects and then fill in regions 208 00:14:14,613 --> 00:14:17,604 but in general getting this kind of training data is very expensive. 209 00:14:29,243 --> 00:14:31,357 Yeah, the question is what is the loss function? 210 00:14:31,357 --> 00:14:37,009 So here since we're making a classification decision per pixel then we put a cross entropy loss 211 00:14:37,009 --> 00:14:39,025 on every pixel of the output. 212 00:14:39,025 --> 00:14:42,212 So we have the ground truth category label for every pixel in the output, 213 00:14:42,212 --> 00:14:45,793 then we compute across entropy loss between every pixel in the output 214 00:14:45,793 --> 00:14:48,143 and the ground truth pixels and then 215 00:14:48,143 --> 00:14:52,739 take either a sum or an average over space and then sum or average over the mini-batch. 216 00:14:52,739 --> 00:14:53,572 Question? 217 00:15:18,548 --> 00:15:26,505 Yeah, yeah. Yeah, the question is do we assume 218 00:15:26,505 --> 00:15:28,008 that we know the categories? 219 00:15:28,008 --> 00:15:31,258 So yes, we do assume that we know the categories up front 220 00:15:31,258 --> 00:15:33,716 so this is just like the image classification case. 221 00:15:33,716 --> 00:15:39,466 So an image classification we know at the start of training based on our data set that maybe there's 10 or 20 222 00:15:39,466 --> 00:15:41,357 or 100 or 1000 classes that we care about 223 00:15:41,357 --> 00:15:50,077 for this data set and then here we are fixed to that set of classes that are fixed for the data set. 224 00:15:51,012 --> 00:15:56,206 So this model is relatively simple and you can imagine this working reasonably well 225 00:15:56,206 --> 00:15:58,853 assuming that you tuned all the hyperparameters right 226 00:15:58,853 --> 00:16:00,562 but it's kind of a problem right. 227 00:16:00,562 --> 00:16:05,120 So in this setup, since we're applying a bunch of convolutions that are all keeping the same 228 00:16:05,120 --> 00:16:07,479 spatial size of the input image, 229 00:16:07,479 --> 00:16:09,574 this would be super super expensive right. 230 00:16:09,574 --> 00:16:16,435 If you wanted to do convolutions that maybe have 64 or 128 or 256 channels for those convolutional filters 231 00:16:16,435 --> 00:16:18,982 which is pretty common in a lot of these networks, 232 00:16:18,982 --> 00:16:24,111 then running those convolutions on this high resolution input image over a sequence of layers would be 233 00:16:24,111 --> 00:16:25,849 extremely computationally expensive 234 00:16:25,849 --> 00:16:27,361 and would take a ton of memory. 235 00:16:27,361 --> 00:16:31,304 So in practice, you don't usually see networks with this architecture. 236 00:16:31,304 --> 00:16:37,512 Instead you tend to see networks that look something like this where we have some downsampling 237 00:16:37,512 --> 00:16:39,277 and then some upsampling of the feature map 238 00:16:39,277 --> 00:16:40,592 inside the image. 239 00:16:40,592 --> 00:16:44,614 So rather than doing all the convolutions of the full spatial resolution of the image, 240 00:16:44,614 --> 00:16:48,997 we'll maybe go through a small number of convolutional layers at the original resolution 241 00:16:48,997 --> 00:16:53,991 then downsample that feature map using something like max pooling or strided convolutions 242 00:16:53,991 --> 00:16:55,719 and sort of downsample, downsample, 243 00:16:55,719 --> 00:16:59,338 so we have convolutions in downsampling and convolutions in downsampling 244 00:16:59,338 --> 00:17:04,640 that look much like a lot of the classification networks that you see but now the difference is that 245 00:17:04,640 --> 00:17:09,346 rather than transitioning to a fully connected layer like you might do in an image classification setup, 246 00:17:09,346 --> 00:17:12,071 instead we want to increase the spatial resolution 247 00:17:12,071 --> 00:17:15,213 of our predictions in the second half of the network 248 00:17:15,214 --> 00:17:20,614 so that our output image can now be the same size as our input image and this ends up being 249 00:17:20,614 --> 00:17:22,136 much more computationally efficient 250 00:17:22,136 --> 00:17:26,417 because you can make the network very deep and work at a lower spatial resolution 251 00:17:26,417 --> 00:17:29,749 for many of the layers at the inside of the network. 252 00:17:29,749 --> 00:17:36,418 So we've already seen examples of downsampling when it comes to convolutional networks. 253 00:17:36,418 --> 00:17:41,180 We've seen that you can do strided convolutions or various types of pooling to reduce the spatial size 254 00:17:41,180 --> 00:17:44,050 of the image inside a network but we haven't 255 00:17:44,050 --> 00:17:46,040 really talked about upsampling and the question 256 00:17:46,040 --> 00:17:51,476 you might be wondering is what are these upsampling layers actually look like inside the network? 257 00:17:51,476 --> 00:17:55,875 And what are our strategies for increasing the size of a feature map inside the network? 258 00:17:55,875 --> 00:17:59,208 Sorry, was there a question in the back? 259 00:18:07,316 --> 00:18:09,061 Yeah, so the question is how do we upsample? 260 00:18:09,061 --> 00:18:11,758 And the answer is that's the topic of the next couple slides. 261 00:18:11,758 --> 00:18:13,263 [laughing] 262 00:18:13,263 --> 00:18:21,075 So one strategy for upsampling is something like unpooling so we have this notion of pooling 263 00:18:21,075 --> 00:18:23,379 to downsample so we talked about average pooling 264 00:18:23,379 --> 00:18:26,187 or max pooling so when we talked about average pooling 265 00:18:26,187 --> 00:18:30,389 we're kind of taking a spatial average within a receptive field of each pooling region. 266 00:18:30,389 --> 00:18:34,853 One kind of analog for upsampling is this idea of nearest neighbor unpooling. 267 00:18:34,853 --> 00:18:39,090 So here on the left we see this example of nearest neighbor unpooling where our input 268 00:18:39,090 --> 00:18:41,379 is maybe some two by two grid and our output 269 00:18:41,379 --> 00:18:43,853 is a four by four grid and now in our output 270 00:18:43,853 --> 00:18:50,461 we've done a two by two stride two nearest neighbor unpooling or upsampling where we've just duplicated 271 00:18:50,461 --> 00:18:53,177 that element for every point in our two by two 272 00:18:53,177 --> 00:18:56,149 receptive field of the unpooling region. 273 00:18:56,149 --> 00:19:03,472 Another thing you might see is this bed of nails unpooling or bed of nails upsampling where you'll just take, 274 00:19:03,472 --> 00:19:09,116 again we have a two by two receptive field for our unpooling regions and then you'll take the, 275 00:19:09,116 --> 00:19:23,462 in this case you make it all zeros except for one element of the unpooling region so in this case we've taken all of our inputs and always put them in the upper left hand corner of this unpooling region and everything else is zeros. 276 00:19:23,463 --> 00:19:24,867 And this is kind of like a bed of nails 277 00:19:24,867 --> 00:19:33,559 because the zeros are very flat, then you've got these things poking up for the values at these various non-zero regions. 278 00:19:33,560 --> 00:19:39,591 Another thing that you see sometimes which was alluded to by the question a minute ago is this idea of max unpooling 279 00:19:39,591 --> 00:19:52,046 so in a lot of these networks they tend to be symmetrical where we have a downsampling portion of the network and then an upsampling portion of the network with a symmetry between those two portions of the network. 280 00:19:52,047 --> 00:20:06,139 So sometimes what you'll see is this idea of max unpooling where for each unpooling, for each upsampling layer, it is associated with one of the pooling layers in the first half of the network and now in the first half, 281 00:20:06,140 --> 00:20:16,464 in the downsampling when we do max pooling we'll actually remember which element of the receptive field during max pooling was used to do the max pooling 282 00:20:16,465 --> 00:20:26,390 and now when we go through the rest of the network then we'll do something that looks like this bed of nails upsampling except rather than always putting the elements in the same position, instead we'll stick it 283 00:20:26,391 --> 00:20:33,697 into the position that was used in the corresponding max pooling step earlier in the network. 284 00:20:33,697 --> 00:20:38,321 I'm not sure if that explanation was clear but hopefully the picture makes sense. 285 00:20:39,248 --> 00:20:42,388 Yeah, so then you just end up filling the rest with zeros. 286 00:20:42,388 --> 00:20:48,256 So then you fill the rest with zeros and then you stick the elements from the low resolution patch up into the high resolution patch 287 00:20:48,256 --> 00:20:54,964 at the points where the max pooling took place at the corresponding max pooling there. 288 00:20:56,871 --> 00:21:00,723 Okay, so that's kind of an interesting idea. 289 00:21:00,723 --> 00:21:02,056 Sorry, question? 290 00:21:08,696 --> 00:21:11,801 Oh yeah, so the question is why is this a good idea? Why might this matter? 291 00:21:11,801 --> 00:21:16,806 So the idea is that when we're doing semantic segmentation we want our predictions to be pixel perfect right. 292 00:21:16,806 --> 00:21:23,708 We kind of want to get those sharp boundaries and those tiny details in our predictive segmentation 293 00:21:23,708 --> 00:21:31,782 so now if you're doing this max pooling, there's this sort of heterogeneity that's happening inside the feature map due to the max pooling 294 00:21:31,782 --> 00:21:44,363 where from the low resolution image you don't know, you're sort of losing spatial information in some sense by you don't know where that feature vector came from in the local receptive field after max pooling. 295 00:21:45,253 --> 00:21:53,759 So if you actually unpool by putting the vector in the same slot you might think that that might help us handle these fine details a little bit better 296 00:21:53,759 --> 00:21:59,051 and help us preserve some of that spatial information that was lost during max pooling. 297 00:21:59,051 --> 00:21:59,884 Question? 298 00:22:10,883 --> 00:22:13,809 The question is does this make things easier for back prop? 299 00:22:13,809 --> 00:22:21,009 Yeah, I guess, I don't think it changes the back prop dynamics too much because storing these indices is not a huge computational overhead. 300 00:22:21,009 --> 00:22:24,851 They're pretty small in comparison to everything else. 301 00:22:24,851 --> 00:22:29,566 So another thing that you'll see sometimes is this idea of transpose convolution. 302 00:22:29,566 --> 00:22:34,724 So transpose convolution, so for these various types of unpooling that we just talked about, 303 00:22:34,724 --> 00:22:38,945 these bed of nails, this nearest neighbor, this max unpooling, all of these are kind of 304 00:22:38,945 --> 00:22:44,964 a fixed function, they're not really learning exactly how to do the upsampling so if you think about something 305 00:22:44,964 --> 00:22:47,404 like strided convolution, strided convolution 306 00:22:47,404 --> 00:22:54,423 is kind of like a learnable layer that learns the way that the network wants to perform downsampling at that layer. 307 00:22:54,423 --> 00:23:02,534 And by analogy with that there's this type of layer called a transpose convolution that lets us do kind of learnable upsampling. 308 00:23:02,534 --> 00:23:08,068 So it will both upsample the feature map and learn some weights about how it wants to do that upsampling. 309 00:23:08,068 --> 00:23:13,262 And this is really just another type of convolution so to see how this works remember how a normal 310 00:23:13,262 --> 00:23:16,663 three by three stride one pad one convolution would work. 311 00:23:16,663 --> 00:23:20,488 That for this kind of normal convolution that we've seen many times now in this class, 312 00:23:20,488 --> 00:23:24,316 our input might by four by four, our output might be four by four, 313 00:23:24,316 --> 00:23:29,721 and now we'll have this three by three kernel and we'll take an inner product between, we'll plop down that kernel at the corner of the image, 314 00:23:29,721 --> 00:23:35,409 take an inner product, and that inner product will give us the value and the activation in the upper left hand corner of our output. 315 00:23:35,409 --> 00:23:39,388 And we'll repeat this for every receptive field in the image. 316 00:23:39,388 --> 00:23:44,688 Now if we talk about strided convolution then strided convolution ends up looking pretty similar. 317 00:23:44,688 --> 00:23:49,648 However, our input is maybe a four by four region and our output is a two by two region. 318 00:23:49,648 --> 00:24:00,808 But we still have this idea of taking, of there being some three by three filter or kernel that we plop down in the corner of the image, take an inner product and use that to compute a value of the activation and the output. 319 00:24:00,808 --> 00:24:08,879 But now with strided convolution the idea is that we're moving that, rather than popping down that filter at every possible point in the input, 320 00:24:08,879 --> 00:24:16,961 instead we're going to move the filter by two pixels in the input every time we move the filter by one pixel, every time we move by one pixel in the output. 321 00:24:16,961 --> 00:24:23,361 Right so this stride of two gives us a ratio between how much do we move in the input versus how much do we move in the output. 322 00:24:23,361 --> 00:24:32,495 So when you do a strided convolution with stride two this ends up downsampling the image or the feature map by a factor of two in kind of a learnable way. 323 00:24:32,495 --> 00:24:42,638 And now a transpose convolution is sort of the opposite in a way so here our input will be a two by two region and our output will be a four by four region. 324 00:24:42,638 --> 00:24:46,904 But now the operation that we perform with transpose convolution is a little bit different. 325 00:24:46,904 --> 00:24:56,074 Now so rather than taking an inner product instead what we're going to do is we're going to take the value of our input feature map 326 00:24:56,074 --> 00:25:00,856 at that upper left hand corner and that'll be some scalar value in the upper left hand corner. 327 00:25:00,856 --> 00:25:06,767 We're going to multiply the filter by that scalar value and then copy those values over to this three by three 328 00:25:06,767 --> 00:25:14,428 region in the output so rather than taking an inner product with our filter and the input, instead our input 329 00:25:14,428 --> 00:25:24,911 gives weights that we will use to weight the filter and then our output will be weighted copies of the filter that are weighted by the values in the input. 330 00:25:24,911 --> 00:25:36,703 And now we can do this sort of same ratio trick in order to upsample so now when we move one pixel in the input now we can plop our filter down two pixels away in the output and it's the same trick 331 00:25:36,703 --> 00:25:43,713 that now the blue pixel in the input is some scalar value and we'll take that scalar value, multiply it by the values in the filter, 332 00:25:43,713 --> 00:25:49,048 and copy those weighted filter values into this new region in the output. 333 00:25:49,048 --> 00:25:54,765 The tricky part is that sometimes these receptive fields in the output can overlap now and now when these 334 00:25:54,765 --> 00:26:00,143 receptive fields in the output overlap we just sum the results in the output. 335 00:26:00,143 --> 00:26:07,931 So then you can imagine repeating this everywhere and repeating this process everywhere and this ends up doing sort of a learnable upsampling 336 00:26:07,931 --> 00:26:14,466 where we use these learned convolutional filter weights to upsample the image and increase the spatial size. 337 00:26:15,609 --> 00:26:19,975 By the way, you'll see this operation go by a lot of different names in literature. 338 00:26:19,975 --> 00:26:24,153 Sometimes this gets called things like deconvolution 339 00:26:24,153 --> 00:26:27,024 which I think is kind of a bad name but you'll see it 340 00:26:27,024 --> 00:26:34,066 out there in papers so from a signal processing perspective deconvolution means the inverse operation to convolution 341 00:26:34,066 --> 00:26:39,945 which this is not however you'll frequently see this type of layer called a deconvolution layer 342 00:26:39,945 --> 00:26:44,121 in some deep learning papers so be aware of that, watch out for that terminology. 343 00:26:44,121 --> 00:26:48,280 You'll also sometimes see this called upconvolution which is kind of a cute name. 344 00:26:48,280 --> 00:26:51,490 Sometimes it gets called fractionally strided convolution 345 00:26:51,490 --> 00:27:01,437 because if we think of the stride as the ratio in step between the input and the output then now this is something like a stride one half convolution because of this ratio 346 00:27:01,437 --> 00:27:04,869 of one to two between steps in the input and steps in the output. 347 00:27:04,869 --> 00:27:09,311 This also sometimes gets called a backwards strided convolution because if you think about it 348 00:27:09,311 --> 00:27:15,287 and work through the math this ends up being the same, the forward pass of a transpose convolution 349 00:27:15,287 --> 00:27:20,030 ends up being the same mathematical operation as the backwards pass in a normal convolution 350 00:27:20,030 --> 00:27:28,698 so you might have to take my word for it, that might not be super obvious when you first look at this but that's kind of a neat fact so you'll sometimes see that name as well. 351 00:27:28,698 --> 00:27:36,923 And as maybe a bit of a more concrete example of what this looks like I think it's maybe a little easier to see in one dimension so if we imagine, 352 00:27:36,923 --> 00:27:41,272 so here we're doing a three by three transpose convolution in one dimension. 353 00:27:41,272 --> 00:27:46,091 Sorry, not three by three, a three by one transpose convolution in one dimension. 354 00:27:46,091 --> 00:27:50,211 So our filter here is just three numbers. Our input is two numbers and now you can see 355 00:27:50,211 --> 00:27:58,060 that in our output we've taken the values in the input, used them to weight the values of the filter and plopped down those weighted filters in the output 356 00:27:58,060 --> 00:28:03,597 with a stride of two and now where these receptive fields overlap in the output then we sum. 357 00:28:03,597 --> 00:28:12,253 So you might be wondering, this is kind of a funny name. Where does the name transpose convolution come from and why is that actually my preferred name for this operation? 358 00:28:12,253 --> 00:28:15,530 So that comes from this kind of neat interpretation of convolution. 359 00:28:15,530 --> 00:28:21,902 So it turns out that any time you do convolution you can always write convolution as a matrix multiplication. 360 00:28:21,902 --> 00:28:25,737 So again, this is kind of easier to see with a one-dimensional example 361 00:28:25,737 --> 00:28:33,470 but here we've got some weight. So we're doing a one-dimensional convolution of a weight vector x which has three elements, 362 00:28:34,497 --> 00:28:38,706 and an input vector, a vector, which has four elements, A, B, C, D. 363 00:28:38,706 --> 00:28:47,869 So here we're doing a three by one convolution with stride one and you can see that we can frame this whole operation as a matrix multiplication 364 00:28:47,869 --> 00:28:54,781 where we take our convolutional kernel x and turn it into some matrix capital X 365 00:28:54,781 --> 00:28:59,360 which contains copies of that convolutional kernel that are offset by different regions. 366 00:28:59,360 --> 00:29:08,157 And now we can take this giant weight matrix X and do a matrix vector multiplication between x and our input a and this just produces the same result as convolution. 367 00:29:09,274 --> 00:29:17,770 And now with transpose convolution means that we're going to take this same weight matrix but now we're going to multiply by the transpose of that same weight matrix. 368 00:29:17,770 --> 00:29:26,491 So here you can see the same example for this stride one convolution on the left and the corresponding stride one transpose convolution on the right. 369 00:29:26,491 --> 00:29:31,018 And if you work through the details you'll see that when it comes to stride one, 370 00:29:31,018 --> 00:29:37,570 a stride one transpose convolution also ends up being a stride one normal convolution so there's a little bit 371 00:29:37,570 --> 00:29:42,334 of details in the way that the border and the padding are handled but it's fundamentally the same operation. 372 00:29:42,334 --> 00:29:45,879 But now things look different when you talk about a stride of two. 373 00:29:45,879 --> 00:29:54,240 So again, here on the left we can take a stride two convolution and write out this stride two convolution as a matrix multiplication. 374 00:29:54,240 --> 00:29:59,837 And now the corresponding transpose convolution is no longer a convolution so if you look 375 00:29:59,837 --> 00:30:04,985 through this weight matrix and think about how convolutions end up getting represented in this way 376 00:30:04,985 --> 00:30:13,913 then now this transposed matrix for the stride two convolution is something fundamentally different from the original normal convolution operation 377 00:30:13,913 --> 00:30:20,647 so that's kind of the reasoning behind the name and that's why I think that's kind of the nicest name to call this operation by. 378 00:30:20,647 --> 00:30:22,980 Sorry, was there a question? 379 00:30:27,991 --> 00:30:29,646 Sorry? 380 00:30:29,646 --> 00:30:36,523 It's very possible there's a typo in the slide so please point out on Piazza and I'll fix it but I hope the idea was clear. 381 00:30:36,523 --> 00:30:43,000 Is there another question? Okay, thank you [laughing]. Yeah, so, oh no lots of questions. 382 00:30:53,576 --> 00:30:56,360 Yeah, so the issue is why do we sum and not average? 383 00:30:56,360 --> 00:31:03,404 So the reason we sum is due to this transpose convolution formula zone so that's the reason why we sum 384 00:31:03,404 --> 00:31:11,325 but you're right that you actually, this is kind of a problem that the magnitudes will actually vary in the output depending on how many receptive fields were in the output. 385 00:31:11,325 --> 00:31:15,322 So actually in practice this is something that people started to point out very recently and somewhat 386 00:31:15,322 --> 00:31:26,250 switched away from this stride, so using three by three stride two transpose convolution upsampling can sometimes produce some checkerboard artifacts in the output exactly due to that problem. 387 00:31:26,250 --> 00:31:37,127 So what I've seen in a couple more recent papers is maybe to use four by four stride two or two by two stride two transpose convolution for upsampling and that helps alleviate that problem a little bit. 388 00:31:46,834 --> 00:31:52,515 - Yeah, so the question is what is a stride half convolution - and where does that terminology come from? 389 00:31:52,515 --> 00:31:56,790 I think that was from my paper. So that was actually, yes that was definitely this. 390 00:31:56,790 --> 00:32:01,181 So at the time I was writing that paper I was kind of into the name fractionally strided convolution 391 00:32:01,181 --> 00:32:07,282 but after thinking about it a bit more I think transpose convolution is probably the right name. 392 00:32:07,282 --> 00:32:13,746 So then this idea of semantic segmentation actually ends up being pretty natural. 393 00:32:13,746 --> 00:32:19,540 You just have this giant convolutional network with downsampling and upsampling inside the network 394 00:32:19,540 --> 00:32:22,053 and now our downsampling will be by strided convolution 395 00:32:22,053 --> 00:32:28,035 or pooling, our upsampling will be by transpose convolution or various types of unpooling or upsampling 396 00:32:28,035 --> 00:32:33,634 and we can train this whole thing end to end with back propagation using this cross entropy loss over every pixel. 397 00:32:33,634 --> 00:32:41,514 So this is actually pretty cool that we can take a lot of the same machinery that we already learned for image classification and now just apply it 398 00:32:41,514 --> 00:32:45,414 very easily to extend to new types of problems so that's super cool. 399 00:32:46,333 --> 00:32:52,024 So the next task that I want to talk about is this idea of classification plus localization. 400 00:32:52,024 --> 00:32:54,953 So we've talked about image classification a lot 401 00:32:54,953 --> 00:33:01,234 where we want to just assign a category label to the input image but sometimes you might want to know a little bit more about the image. 402 00:33:01,234 --> 00:33:09,077 In addition to predicting what the category is, in this case the cat, you might also want to know where is that object in the image? 403 00:33:09,077 --> 00:33:17,874 So in addition to predicting the category label cat, you might also want to draw a bounding box around the region of the cat in that image. 404 00:33:17,874 --> 00:33:22,713 And classification plus localization, the distinction here between this and object detection 405 00:33:22,713 --> 00:33:31,242 is that in the localization scenario you assume ahead of time that you know there's exactly one object in the image that you're looking for or maybe more than one 406 00:33:31,242 --> 00:33:41,001 but you know ahead of time that we're going to make some classification decision about this image and we're going to produce exactly one bounding box that's going to tell us where that object is located 407 00:33:41,001 --> 00:33:47,584 in the image so we sometimes call that task classification plus localization. 408 00:33:47,584 --> 00:33:53,680 And again, we can reuse a lot of the same machinery that we've already learned from image classification in order to tackle this problem. 409 00:33:53,680 --> 00:33:58,220 So kind of a basic architecture for this problem looks something like this. 410 00:33:58,220 --> 00:34:09,301 So again, we have our input image, we feed our input image through some giant convolutional network, this is Alex, this is AlexNet for example, which will give us some final vector summarizing the content of the image. 411 00:34:09,301 --> 00:34:15,730 Then just like before we'll have some fully connected layer that goes from that final vector to our class scores. 412 00:34:15,730 --> 00:34:21,109 But now we'll also have another fully connected layer that goes from that vector to four numbers. 413 00:34:21,109 --> 00:34:28,478 Where the four numbers are something like the height, the width, and the x and y positions of that bounding box. 414 00:34:28,478 --> 00:34:34,228 And now our network will produce these two different outputs, one is this set of class scores, 415 00:34:34,228 --> 00:34:39,094 and the other are these four numbers giving the coordinates of the bounding box in the input image. 416 00:34:39,094 --> 00:34:44,489 And now during training time, when we train this network we'll actually have two losses so in this scenario 417 00:34:44,489 --> 00:34:47,210 we're sort of assuming a fully supervised setting 418 00:34:47,210 --> 00:34:55,330 so we assume that each of our training images is annotated with both a category label and also a ground truth bounding box for that category in the image. 419 00:34:55,331 --> 00:34:57,118 So now we have two loss functions. 420 00:34:57,118 --> 00:35:03,360 We have our favorite softmax loss that we compute using the ground truth category label and the predicted class scores, 421 00:35:03,360 --> 00:35:13,669 and we also have some kind of loss that gives us some measure of dissimilarity between our predicted coordinates for the bounding box and our actual coordinates for the bounding box. 422 00:35:13,669 --> 00:35:20,509 So one very simple thing is to just take an L2 loss between those two and that's kind of the simplest thing that you'll see in practice although sometimes 423 00:35:20,509 --> 00:35:27,728 people play around with this and maybe use L1 or smooth L1 or they parametrize the bounding box a little bit differently but the idea is always the same, 424 00:35:27,728 --> 00:35:35,509 that you have some regression loss between your predicted bounding box coordinates and the ground truth bounding box coordinates. 425 00:35:35,509 --> 00:35:39,510 Question? Sorry, go ahead. 426 00:35:49,410 --> 00:35:52,193 So the question is, is this a good idea to do all at the same time? 427 00:35:52,193 --> 00:35:55,600 Like what happens if you misclassify, should you even look at the box coordinates? 428 00:35:55,600 --> 00:35:59,901 So sometimes people get fancy with it, so in general it works okay. 429 00:35:59,901 --> 00:36:03,652 It's not a big problem, you can actually train a network to do both of these things at the same time 430 00:36:03,652 --> 00:36:09,592 and it'll figure it out but sometimes things can get tricky in terms of misclassification so sometimes what you'll see 431 00:36:09,592 --> 00:36:19,232 for example is that rather than predicting a single box you might make predictions like a separate prediction of the box for each category and then only apply loss 432 00:36:19,232 --> 00:36:24,091 to the predicted box corresponding to the ground truth category. 433 00:36:24,091 --> 00:36:28,318 So people do get a little bit fancy with these things that sometimes helps a bit in practice. 434 00:36:28,318 --> 00:36:34,611 But at least this basic setup, it might not be perfect or it might not be optimal but it will work and it will do something. 435 00:36:34,611 --> 00:36:37,361 Was there a question in the back? 436 00:36:41,226 --> 00:36:46,746 Yeah, so that's the question is do these losses have different units, do they dominate the gradient? 437 00:36:46,746 --> 00:36:49,306 So this is what we call a multi-task loss 438 00:36:49,306 --> 00:36:58,554 so whenever we're taking derivatives we always want to take derivative of a scalar with respect to our network parameters and use that derivative to take gradient steps. 439 00:36:58,554 --> 00:37:01,331 But now we've got two scalars that we want to both minimize 440 00:37:01,331 --> 00:37:11,833 so what you tend to do in practice is have some additional hyperparameter that gives you some weighting between these two losses so you'll take a weighted sum of these two different loss functions to give our final scalar loss. 441 00:37:11,833 --> 00:37:15,642 And then you'll take your gradients with respect to this weighted sum of the two losses. 442 00:37:15,642 --> 00:37:23,691 And this ends up being really really tricky because this weighting parameter is a hyperparameter that you need to set but it's kind of different 443 00:37:23,691 --> 00:37:27,851 from some of the other hyperparameters that we've seen so far in the past right 444 00:37:27,851 --> 00:37:32,390 because this weighting hyperparameter actually changes the value of the loss function 445 00:37:32,390 --> 00:37:43,091 so one thing that you might often look at when you're trying to set hyperparameters is you might make different hyperparameter choices and see what happens to the loss under different choices of hyperparameters. 446 00:37:43,091 --> 00:37:51,089 But in this case because the loss actually, because the hyperparameter affects the absolute value of the loss making those comparisons becomes kind of tricky. 447 00:37:51,089 --> 00:37:54,473 So setting that hyperparameter is somewhat difficult. 448 00:37:54,473 --> 00:38:00,393 And in practice, you kind of need to take it on a case by case basis for exactly the problem you're solving but my general strategy for this 449 00:38:00,393 --> 00:38:08,163 is to have some other metric of performance that you care about other than the actual loss value 450 00:38:08,163 --> 00:38:17,763 which then you actually use that final performance metric to make your cross validation choices rather than looking at the value of the loss to make those choices. 451 00:38:17,763 --> 00:38:18,596 Question? 452 00:38:27,529 --> 00:38:32,682 So the question is why do we do this all at once? Why not do this separately? 453 00:38:38,131 --> 00:38:45,413 Yeah, so the question is why don't we fix the big network and then just only learn separate fully connected layers for these two tasks? 454 00:38:45,413 --> 00:38:52,702 People do do that sometimes and in fact that's probably the first thing you should try if you're faced with a situation like this but in general 455 00:38:52,702 --> 00:39:00,574 whenever you're doing transfer learning you always get better performance if you fine tune the whole system jointly because there's probably some mismatch between the features, 456 00:39:00,574 --> 00:39:09,280 if you train on ImageNet and then you use that network for your data set you're going to get better performance on your data set if you can also change the network. 457 00:39:09,280 --> 00:39:16,870 But one trick you might see in practice sometimes is that you might freeze that network then train those two things separately until convergence 458 00:39:16,870 --> 00:39:20,398 and then after they converge then you go back and jointly fine tune the whole system. 459 00:39:20,398 --> 00:39:24,558 So that's a trick that sometimes people do in practice in that situation. 460 00:39:24,558 --> 00:39:30,978 And as I've kind of alluded to this big network is often a pre-trained network that is taken from ImageNet for example. 461 00:39:31,979 --> 00:39:37,339 So a bit of an aside, this idea of predicting some fixed number of positions in the image 462 00:39:37,339 --> 00:39:41,881 can be applied to a lot of different problems beyond just classification plus localization. 463 00:39:41,881 --> 00:39:44,710 One kind of cool example is human pose estimation. 464 00:39:44,710 --> 00:39:49,440 So here we want to take an input image is a picture of a person. 465 00:39:49,440 --> 00:39:56,462 We want to output the positions of the joints for that person and this actually allows the network to predict what is the pose of the human. 466 00:39:56,462 --> 00:39:59,030 Where are his arms, where are his legs, stuff like that, 467 00:39:59,030 --> 00:40:04,060 and generally most people have the same number of joints. That's a bit of a simplifying assumption, 468 00:40:04,060 --> 00:40:06,862 it might not always be true but it works for the network. 469 00:40:06,862 --> 00:40:10,251 So for example one parameterization that you might see 470 00:40:10,251 --> 00:40:13,451 in some data sets is define a person's pose 471 00:40:13,451 --> 00:40:15,430 by 14 joint positions. 472 00:40:15,430 --> 00:40:16,932 Their feet and their knees and their hips 473 00:40:16,932 --> 00:40:19,652 and something like that and now when we train the network 474 00:40:19,652 --> 00:40:23,150 then we're going to input this image of a person 475 00:40:23,150 --> 00:40:27,132 and now we're going to output 14 numbers in this case 476 00:40:27,132 --> 00:40:30,521 giving the x and y coordinates for each of those 14 joints. 477 00:40:30,521 --> 00:40:33,120 And then you apply some kind of regression loss 478 00:40:33,120 --> 00:40:35,961 on each of those 14 different predicted points 479 00:40:35,961 --> 00:40:40,619 and just train this network with back propagation again. 480 00:40:40,619 --> 00:40:43,579 Yeah, so you might see an L2 loss but people play around 481 00:40:43,579 --> 00:40:46,571 with other regression losses here as well. 482 00:40:46,571 --> 00:40:47,404 Question? 483 00:40:50,934 --> 00:40:52,432 So the question is what do I mean 484 00:40:52,432 --> 00:40:53,992 when I say regression loss? 485 00:40:53,992 --> 00:40:56,099 So I mean something other than cross entropy 486 00:40:56,099 --> 00:40:57,294 or softmax right. 487 00:40:57,294 --> 00:40:59,094 When I say regression loss I usually mean 488 00:40:59,094 --> 00:41:02,382 like an L2 Euclidean loss or an L1 loss 489 00:41:02,382 --> 00:41:04,494 or sometimes a smooth L1 loss. 490 00:41:04,494 --> 00:41:07,512 But in general classification versus regression 491 00:41:07,512 --> 00:41:10,502 is whether your output is categorical or continuous 492 00:41:10,502 --> 00:41:12,643 so if you're expecting a categorical output 493 00:41:12,643 --> 00:41:15,272 like you ultimately want to make a classification decision 494 00:41:15,272 --> 00:41:17,243 over some fixed number of categories 495 00:41:17,243 --> 00:41:19,942 then you'll think about a cross entropy loss, 496 00:41:19,942 --> 00:41:23,094 softmax loss or these SVM margin type losses 497 00:41:23,094 --> 00:41:25,022 that we talked about already in the class. 498 00:41:25,022 --> 00:41:28,272 But if your expected output is to be some continuous value, 499 00:41:28,272 --> 00:41:30,222 in this case the position of these points, 500 00:41:30,222 --> 00:41:32,174 then your output is continuous so you tend to use 501 00:41:32,174 --> 00:41:34,734 different types of losses in those situations. 502 00:41:34,734 --> 00:41:37,883 Typically an L2, L1, different kinds of things there. 503 00:41:37,883 --> 00:41:41,482 So sorry for not clarifying that earlier. 504 00:41:41,482 --> 00:41:44,471 But the bigger point here is that for any time 505 00:41:44,471 --> 00:41:46,832 you know that you want to make some fixed number 506 00:41:46,832 --> 00:41:51,003 of outputs from your network, if you know for example. 507 00:41:51,003 --> 00:41:54,344 Maybe you knew that you wanted to, 508 00:41:54,344 --> 00:41:56,395 you knew that you always are going to have pictures 509 00:41:56,395 --> 00:41:58,763 of a cat and a dog and you want to predict both 510 00:41:58,763 --> 00:42:01,392 the bounding box of the cat and the bounding box of the dog 511 00:42:01,392 --> 00:42:03,062 in that case you'd know that you have a fixed number 512 00:42:03,062 --> 00:42:05,304 of outputs for each input so you might imagine 513 00:42:05,304 --> 00:42:07,093 hooking up this type of regression 514 00:42:07,093 --> 00:42:09,264 classification plus localization framework 515 00:42:09,264 --> 00:42:10,743 for that problem as well. 516 00:42:10,743 --> 00:42:13,094 So this idea of some fixed number of regression outputs 517 00:42:13,094 --> 00:42:14,872 can be applied to a lot of different problems 518 00:42:14,872 --> 00:42:17,039 including pose estimation. 519 00:42:19,062 --> 00:42:23,531 So the next task that I want to talk about is object detection 520 00:42:23,531 --> 00:42:25,342 and this is a really meaty topic. 521 00:42:25,342 --> 00:42:27,422 This is kind of a core problem in computer vision 522 00:42:27,422 --> 00:42:29,910 and you could probably teach a whole seminar class 523 00:42:29,910 --> 00:42:31,868 on just the history of object detection 524 00:42:31,868 --> 00:42:33,902 and various techniques applied there. 525 00:42:33,902 --> 00:42:35,931 So I'll be relatively brief and try to go over 526 00:42:35,931 --> 00:42:39,691 the main big ideas of object detection plus deep learning 527 00:42:39,691 --> 00:42:42,582 that have been used in the last couple of years. 528 00:42:42,582 --> 00:42:44,731 But the idea in object detection is that 529 00:42:44,731 --> 00:42:47,942 we again start with some fixed set of categories 530 00:42:47,942 --> 00:42:52,182 that we care about, maybe cats and dogs and fish or whatever 531 00:42:52,182 --> 00:42:55,321 but some fixed set of categories that we're interested in. 532 00:42:55,321 --> 00:42:59,030 And now our task is that given our input image, 533 00:42:59,030 --> 00:43:02,470 every time one of those categories appears in the image, 534 00:43:02,470 --> 00:43:05,641 we want to draw a box around it and we want to predict 535 00:43:05,641 --> 00:43:08,710 the category of that box so this is different 536 00:43:08,710 --> 00:43:10,902 from classification plus localization 537 00:43:10,902 --> 00:43:13,620 because there might be a varying number of outputs 538 00:43:13,620 --> 00:43:15,302 for every input image. 539 00:43:15,302 --> 00:43:17,910 You don't know ahead of time how many objects you expect 540 00:43:17,910 --> 00:43:20,081 to find in each image so that's, 541 00:43:20,081 --> 00:43:22,870 this ends up being a pretty challenging problem. 542 00:43:22,870 --> 00:43:25,630 So we've seen graphs, so this is kind of interesting. 543 00:43:25,630 --> 00:43:28,988 We've seen this graph many times of the ImageNet 544 00:43:28,988 --> 00:43:31,870 classification performance as a function of years 545 00:43:31,870 --> 00:43:34,761 and we saw that it just got better and better every year 546 00:43:34,761 --> 00:43:37,342 and there's been a similar trend with object detection 547 00:43:37,342 --> 00:43:39,131 because object detection has again been one 548 00:43:39,131 --> 00:43:41,291 of these core problems in computer vision 549 00:43:41,291 --> 00:43:44,110 that people have cared about for a very long time. 550 00:43:44,110 --> 00:43:46,390 So this slide is due to Ross Girshick 551 00:43:46,390 --> 00:43:48,742 who's worked on this problem a lot and it shows 552 00:43:48,742 --> 00:43:51,070 the progression of object detection performance 553 00:43:51,070 --> 00:43:54,441 on this one particular data set called PASCAL VOC 554 00:43:54,441 --> 00:43:57,230 which has been relatively used for a long time 555 00:43:57,230 --> 00:43:59,462 in the object detection community. 556 00:43:59,462 --> 00:44:02,428 And you can see that up until about 2012 557 00:44:02,428 --> 00:44:04,761 performance on object detection started to stagnate 558 00:44:04,761 --> 00:44:08,161 and slow down a little bit and then in 2013 559 00:44:08,161 --> 00:44:10,039 was when some of the first deep learning approaches 560 00:44:10,039 --> 00:44:12,141 to object detection came around and you could see 561 00:44:12,141 --> 00:44:13,982 that performance just shot up very quickly 562 00:44:13,982 --> 00:44:16,171 getting better and better year over year. 563 00:44:16,171 --> 00:44:21,422 One thing you might notice is that this plot ends in 2015 and it's actually continued to go up since then 564 00:44:21,422 --> 00:44:29,928 so the current state of the art in this data set is well over 80 and in fact a lot of recent papers don't even report results on this data set anymore because it's considered too easy. 565 00:44:29,929 --> 00:44:37,421 So it's a little bit hard to know, I'm not actually sure what is the state of the art number on this data set but it's off the top of this plot. 566 00:44:37,422 --> 00:44:40,924 Sorry, did you have a question? Nevermind. 567 00:44:42,051 --> 00:44:50,960 Okay, so as I already said this is different from localization because there might be differing numbers of objects for each image. 568 00:44:50,961 --> 00:44:57,770 So for example in this cat on the upper left there's only one object so we only need to predict four numbers but now for this image in the middle 569 00:44:57,771 --> 00:45:05,551 there's three animals there so we need our network to predict 12 numbers, four coordinates for each bounding box. 570 00:45:05,552 --> 00:45:13,210 Or in this example of many many ducks then you want your network to predict a whole bunch of numbers. Again, four numbers for each duck. 571 00:45:13,211 --> 00:45:20,683 So this is quite different from object detection. Sorry object detection is quite different from localization 572 00:45:20,683 --> 00:45:28,870 because in object detection you might have varying numbers of objects in the image and you don't know ahead of time how many you expect to find. 573 00:45:28,870 --> 00:45:34,568 So as a result, it's kind of tricky if you want to think of object detection as a regression problem. 574 00:45:34,568 --> 00:45:40,768 So instead, people tend to work, use kind of a different paradigm when thinking about object detection. 575 00:45:40,768 --> 00:45:49,958 So one approach that's very common and has been used for a long time in computer vision is this idea of sliding window approaches to object detection. 576 00:45:49,958 --> 00:45:59,360 So this is kind of similar to this idea of taking small patches and applying that for semantic segmentation and we can apply a similar idea for object detection. 577 00:45:59,360 --> 00:46:05,118 So the ideas is that we'll take different crops from the input image, in this case we've got this crop 578 00:46:05,118 --> 00:46:10,359 in the lower left hand corner of our image and now we take that crop, feed it through our convolutional network 579 00:46:10,359 --> 00:46:14,829 and our convolutional network does a classification decision on that input crop. 580 00:46:14,829 --> 00:46:18,160 It'll say that there's no dog here, there's no cat here, 581 00:46:18,160 --> 00:46:23,899 and then in addition to the categories that we care about we'll add an additional category called background 582 00:46:23,899 --> 00:46:32,288 and now our network can predict background in case it doesn't see any of the categories that we care about, so then when we take this crop 583 00:46:32,288 --> 00:46:39,008 from the lower left hand corner here then our network would hopefully predict background and say that no, there's no object here. 584 00:46:39,008 --> 00:46:44,128 Now if we take a different crop then our network would predict dog yes, cat no, background no. 585 00:46:44,128 --> 00:46:47,680 We take a different crop we get dog yes, cat no, background no. 586 00:46:47,680 --> 00:46:54,372 Or a different crop, dog no, cat yes, background no. Does anyone see a problem here? 587 00:47:00,324 --> 00:47:04,764 Yeah, the question is how do you choose the crops? So this is a huge problem right. 588 00:47:04,764 --> 00:47:10,543 Because there could be any number of objects in this image, these objects could appear at any location in the image, 589 00:47:10,543 --> 00:47:15,583 these objects could appear at any size in the image, these objects could also appear at any aspect ratio 590 00:47:15,583 --> 00:47:29,523 in the image, so if you want to do kind of a brute force sliding window approach you'd end up having to test thousands, tens of thousands, many many many many different crops in order to tackle this problem with a brute force sliding window approach. 591 00:47:29,523 --> 00:47:37,532 And in the case where every one of those crops is going to be fed through a giant convolutional network, this would be completely computationally intractable. 592 00:47:37,532 --> 00:47:45,920 So in practice people don't ever do this sort of brute force sliding window approach for object detection using convolutional networks. 593 00:47:47,044 --> 00:47:54,492 Instead there's this cool line of work called region proposals that comes from, this is not using deep learning typically. 594 00:47:54,492 --> 00:47:56,332 These are slightly more traditional computer vision 595 00:47:56,332 --> 00:48:05,401 techniques but the idea is that a region proposal network kind of uses more traditional signal processing, image processing type things to make some list 596 00:48:05,401 --> 00:48:14,341 of proposals for where, so given an input image, a region proposal network will then give you something like a thousand boxes where an object might be present. 597 00:48:14,341 --> 00:48:22,382 So you can imagine that maybe we do some local, we look for edges in the image and try to draw boxes that contain closed edges or something like that. 598 00:48:22,382 --> 00:48:30,132 These various types of image processing approaches, but these region proposal networks will basically look for blobby regions in our input image and then give us 599 00:48:30,132 --> 00:48:38,962 some set of candidate proposal regions where objects might be potentially found. And these are relatively fast-ish to run 600 00:48:38,962 --> 00:48:44,703 so one common example of a region proposal method that you might see is something called Selective Search 601 00:48:44,703 --> 00:48:49,284 which I think actually gives you 2000 region proposals, not the 1000 that it says on the slide. 602 00:48:49,284 --> 00:48:59,404 So you kind of run this thing and then after about two seconds of turning on your CPU it'll spit out 2000 region proposals in the input image where objects are likely to be found 603 00:48:59,404 --> 00:49:05,052 so there'll be a lot of noise in those. Most of them will not be true objects but there's a pretty high recall. 604 00:49:05,052 --> 00:49:11,204 If there is an object in the image then it does tend to get covered by these region proposals from Selective Search. 605 00:49:11,204 --> 00:49:17,103 So now rather than applying our classification network to every possible location and scale in the image 606 00:49:17,103 --> 00:49:25,164 instead what we can do is first apply one of these region proposal networks to get some set of proposal regions where objects are likely located 607 00:49:25,164 --> 00:49:33,135 and now apply a convolutional network for classification to each of these proposal regions and this will end up being much more computationally tractable 608 00:49:33,135 --> 00:49:36,903 than trying to do all possible locations and scales. 609 00:49:36,903 --> 00:49:45,583 And this idea all came together in this paper called R-CNN from a few years ago that does exactly that. 610 00:49:45,583 --> 00:49:53,263 So given our input image in this case we'll run some region proposal network to get our proposals, these are also sometimes called 611 00:49:53,263 --> 00:49:56,724 regions of interest or ROI's so again Selective Search 612 00:49:56,724 --> 00:49:59,692 gives you something like 2000 regions of interest. 613 00:49:59,692 --> 00:50:07,043 Now one of the problems here is that these input, these regions in the input image could have different sizes 614 00:50:07,043 --> 00:50:13,143 but if we're going to run them all through a convolutional network our classification, our convolutional networks for classification 615 00:50:13,143 --> 00:50:18,149 all want images of the same input size typically due to the fully connected net layers and whatnot 616 00:50:18,149 --> 00:50:26,855 so we need to take each of these region proposals and warp them to that fixed square size that is expected as input to our downstream network. 617 00:50:26,855 --> 00:50:34,090 So we'll crop out those region proposal, those regions corresponding to the region proposals, we'll warp them to that fixed size, 618 00:50:34,090 --> 00:50:37,418 and then we'll run each of them through a convolutional network 619 00:50:37,418 --> 00:50:48,479 which will then use in this case an SVM to make a classification decision for each of those, to predict categories for each of those crops. 620 00:50:48,479 --> 00:50:52,506 And then I lost a slide. 621 00:50:52,506 --> 00:51:05,650 But it'll also, not shown in the slide right now but in addition R-CNN also predicts a regression, like a correction to the bounding box in addition for each of these input region proposals 622 00:51:05,650 --> 00:51:13,549 because the problem is that your input region proposals are kind of generally in the right position for an object but they might not be perfect so in addition R-CNN will, 623 00:51:13,549 --> 00:51:24,658 in addition to category labels for each of these proposals, it'll also predict four numbers that are kind of an offset or a correction to the box that was predicted at the region proposal stage. 624 00:51:24,658 --> 00:51:27,919 So then again, this is a multi-task loss and you would train this whole thing. 625 00:51:27,919 --> 00:51:30,169 Sorry was there a question? 626 00:51:35,511 --> 00:51:39,359 The question is how much does the change in aspect ratio impact accuracy? 627 00:51:40,698 --> 00:51:41,772 It's a little bit hard to say. 628 00:51:41,772 --> 00:51:46,551 I think there's some controlled experiments in some of these papers but I'm not sure 629 00:51:46,551 --> 00:51:48,738 I can give a generic answer to that. 630 00:51:48,738 --> 00:51:49,571 Question? 631 00:51:53,602 --> 00:51:56,772 The question is is it necessary for regions of interest to be rectangles? 632 00:51:56,772 --> 00:52:03,731 So they typically are because it's tough to warp these non-region things but once you move 633 00:52:03,731 --> 00:52:08,911 to something like instant segmentation then you sometimes get proposals that are not rectangles. 634 00:52:08,911 --> 00:52:12,071 If you actually do care about predicting things that are not rectangles. 635 00:52:12,071 --> 00:52:14,238 Is there another question? 636 00:52:18,704 --> 00:52:24,375 Yeah, so the question is are the region proposals learned so in R-CNN it's a traditional thing. 637 00:52:24,375 --> 00:52:29,203 These are not learned, this is kind of some fixed algorithm that someone wrote down but we'll see in a couple minutes 638 00:52:29,203 --> 00:52:33,466 that we can actually, we've changed that a little bit in the last couple of years. 639 00:52:33,466 --> 00:52:35,633 Is there another question? 640 00:52:37,767 --> 00:52:40,735 The question is is the offset always inside the region of interest? 641 00:52:40,735 --> 00:52:42,665 The answer is no, it doesn't have to be. 642 00:52:42,665 --> 00:52:50,786 You might imagine that suppose the region of interest put a box around a person but missed the head then you could imagine the network inferring 643 00:52:50,786 --> 00:52:55,906 that oh this is a person but people usually have heads so the network showed the box should be a little bit higher. 644 00:52:55,906 --> 00:52:59,666 So sometimes the final predicted boxes will be outside the region of interest. 645 00:52:59,666 --> 00:53:00,499 Question? 646 00:53:08,110 --> 00:53:12,801 Yeah. Yeah the question is you have a lot of ROI's that don't correspond to true objects? 647 00:53:15,877 --> 00:53:22,550 And like we said, in addition to the classes that you actually care about you add an additional background class so your class scores can also 648 00:53:22,550 --> 00:53:26,289 predict background to say that there was no object here. 649 00:53:26,289 --> 00:53:27,122 Question? 650 00:53:37,716 --> 00:53:40,894 Yeah, so the question is what kind of data do we need 651 00:53:40,894 --> 00:53:53,383 and yeah, this is fully supervised in the sense that our training data has each image, consists of images. Each image has all the object categories marked with bounding boxes for each instance of that category. 652 00:53:53,383 --> 00:54:02,945 There are definitely papers that try to approach this like oh what if you don't have the data. What if you only have that data for some images? Or what if that data is noisy but at least 653 00:54:02,945 --> 00:54:08,568 in the generic case you assume full supervision of all objects in the images at training time. 654 00:54:09,835 --> 00:54:16,535 Okay, so I think we've kind of alluded to this but there's kind of a lot of problems with this R-CNN framework. 655 00:54:16,535 --> 00:54:21,644 And actually if you look at the figure here on the right you can see that additional bounding box head so I'll put it back. 656 00:54:21,644 --> 00:54:25,811 But this is kind of still computationally pretty expensive 657 00:54:27,436 --> 00:54:34,415 because if we've got 2000 region proposals, we're running each of those proposals independently, that can be pretty expensive. 658 00:54:34,415 --> 00:54:42,895 There's also this question of relying on this fixed region proposal network, this fixed region proposals, we're not learning them so that's kind of a problem. 659 00:54:42,895 --> 00:54:46,015 And just in practice it ends up being pretty slow 660 00:54:46,015 --> 00:54:54,721 so in the original implementation R-CNN would actually dump all the features to disk so it'd take hundreds of gigabytes of disk space to store all these features. 661 00:54:54,721 --> 00:54:58,472 Then training would be super slow since you have to make all these different forward and backward passes 662 00:54:58,472 --> 00:55:06,134 through the image and it took something like 84 hours is one number they've recorded for training time so this is super super slow. 663 00:55:06,134 --> 00:55:11,076 And now at test time it's also super slow, something like roughly 30 seconds minute per image 664 00:55:11,076 --> 00:55:18,316 because you need to run thousands of forward passes through the convolutional network for each of these region proposals so this ends up being pretty slow. 665 00:55:18,316 --> 00:55:27,404 Thankfully we have fast R-CNN that fixed a lot of these problems so when we do fast R-CNN then it's going to look kind of the same. 666 00:55:27,404 --> 00:55:34,116 We're going to start with our input image but now rather than processing each region of interest separately instead we're going to run the entire image 667 00:55:34,116 --> 00:55:41,924 through some convolutional layers all at once to give this high resolution convolutional feature map corresponding to the entire image. 668 00:55:41,924 --> 00:55:46,652 And now we still are using some region proposals from some fixed thing like Selective Search 669 00:55:46,652 --> 00:55:52,334 but rather than cropping out the pixels of the image corresponding to the region proposals, 670 00:55:52,334 --> 00:56:04,745 instead we imagine projecting those region proposals onto this convolutional feature map and then taking crops from the convolutional feature map corresponding to each proposal rather than taking crops directly from the image. 671 00:56:04,745 --> 00:56:13,425 And this allows us to reuse a lot of this expensive convolutional computation across the entire image when we have many many crops per image. 672 00:56:13,425 --> 00:56:20,052 But again, if we have some fully connected layers downstream those fully connected layers are expecting some fixed-size input 673 00:56:20,052 --> 00:56:26,131 so now we need to do some reshaping of those crops from the convolutional feature map 674 00:56:26,131 --> 00:56:31,673 and they do that in a differentiable way using something they call an ROI pooling layer. 675 00:56:31,673 --> 00:56:38,622 Once you have these warped crops from the convolutional feature map then you can run these things through some 676 00:56:38,622 --> 00:56:45,673 fully connected layers and predict your classification scores and your linear regression offsets to the bounding boxes. 677 00:56:45,673 --> 00:56:51,654 And now when we train this thing then we again have a multi-task loss that trades off between these two constraints and during back propagation 678 00:56:51,654 --> 00:56:56,124 we can back prop through this entire thing and learn it all jointly. 679 00:56:56,124 --> 00:57:03,575 This ROI pooling, it looks kind of like max pooling. I don't really want to get into the details of that right now. 680 00:57:03,575 --> 00:57:12,014 And in terms of speed if we look at R-CNN versus fast R-CNN versus this other model called SPP net which is kind of in between the two, 681 00:57:12,014 --> 00:57:16,924 then you can see that at training time fast R-CNN is something like 10 times faster to train 682 00:57:16,924 --> 00:57:20,134 because we're sharing all this computation between different feature maps. 683 00:57:20,134 --> 00:57:23,272 And now at test time fast R-CNN is super fast 684 00:57:23,272 --> 00:57:33,764 and in fact fast R-CNN is so fast at test time that its computation time is actually dominated by computing region proposals. 685 00:57:33,764 --> 00:57:39,334 So we said that computing these 2000 region proposals using Selective Search takes something like two seconds 686 00:57:39,334 --> 00:57:53,273 and now once we've got all these region proposals then because we're processing them all sort of in a shared way by sharing these expensive convolutions across the entire image that we can process all of these region proposals in less than a second altogether. 687 00:57:53,273 --> 00:57:59,142 So fast R-CNN ends up being bottlenecked by just the computing of these region proposals. 688 00:57:59,142 --> 00:58:03,804 Thankfully we've solved this problem with faster R-CNN. 689 00:58:03,804 --> 00:58:13,734 So the idea in faster R-CNN is to just make, so the problem was the computing the region proposals using this fixed function was a bottleneck. 690 00:58:13,734 --> 00:58:18,054 So instead we'll just make the network itself predict its own region proposals. 691 00:58:18,054 --> 00:58:30,572 And so the way that this sort of works is that again, we take our input image, run the entire input image altogether through some convolutional layers to get some convolutional feature map representing the entire high resolution image 692 00:58:30,572 --> 00:58:33,204 and now there's a separate region proposal network 693 00:58:33,204 --> 00:58:39,204 which works on top of those convolutional features and predicts its own region proposals inside the network. 694 00:58:39,204 --> 00:58:44,542 Now once we have those predicted region proposals then it looks just like fast R-CNN 695 00:58:44,542 --> 00:58:50,662 where now we take crops from those region proposals from the convolutional features, pass them up to the rest of the network. 696 00:58:50,662 --> 00:58:57,094 And now we talked about multi-task losses and multi-task training networks to do multiple things at once. 697 00:58:57,094 --> 00:59:05,019 Well now we're telling the network to do four things all at once so balancing out this four-way multi-task loss is kind of tricky. 698 00:59:05,019 --> 00:59:14,848 But because the region proposal network needs to do two things: it needs to say for each potential proposal is it an object or not an object, it needs to actually regress 699 00:59:14,848 --> 00:59:18,186 the bounding box coordinates for each of those proposals, 700 00:59:18,186 --> 00:59:21,787 and now the final network at the end needs to do these two things again. 701 00:59:21,787 --> 00:59:26,288 Make final classification decisions for what are the class scores for each of these proposals, 702 00:59:26,288 --> 00:59:34,086 and also have a second round of bounding box regression to again correct any errors that may have come from the region proposal stage. 703 00:59:34,086 --> 00:59:34,919 Question? 704 00:59:45,231 --> 00:59:50,703 So the question is that sometimes multi-task learning might be seen as regularization and are we getting that affect here? 705 00:59:50,703 --> 00:59:52,602 I'm not sure if there's been super controlled studies 706 00:59:52,602 --> 01:00:01,162 on that but actually in the original version of the faster R-CNN paper they did a little bit of experimentation like what if we share 707 01:00:01,162 --> 01:00:03,951 the region proposal network, what if we don't share? 708 01:00:03,951 --> 01:00:08,522 What if we learn separate convolutional networks for the region proposal network versus the classification network? 709 01:00:08,522 --> 01:00:12,970 And I think there were minor differences but it wasn't a dramatic difference either way. 710 01:00:12,970 --> 01:00:18,380 So in practice it's kind of nicer to only learn one because it's computationally cheaper. 711 01:00:18,380 --> 01:00:19,713 Sorry, question? 712 01:00:33,583 --> 01:00:41,903 Yeah the question is how do you train this region proposal network because you don't know, you don't have ground truth region proposals for the region proposal network. 713 01:00:41,903 --> 01:00:45,172 So that's a little bit hairy. I don't want to get too much into those details 714 01:00:45,172 --> 01:00:53,452 but the idea is that at any time you have a region proposal which has more than some threshold of overlap with any of the ground truth objects 715 01:00:53,452 --> 01:00:57,771 then you say that that is the positive region proposal and you should predict that as the region proposal 716 01:00:57,771 --> 01:01:04,471 and any potential proposal which has very low overlap with any ground truth objects should be predicted as a negative. 717 01:01:04,471 --> 01:01:09,550 But there's a lot of dark magic hyperparameters in that process and that's a little bit hairy. 718 01:01:09,550 --> 01:01:10,383 Question? 719 01:01:15,394 --> 01:01:19,793 Yeah, so the question is what is the classification loss on the region proposal network and the answer is 720 01:01:19,793 --> 01:01:26,648 that it's making a binary, so I didn't want to get into too much of the details of that architecture 'cause it's a little bit hairy but it's making binary decisions. 721 01:01:26,648 --> 01:01:32,269 So it has some set of potential regions that it's considering and it's making a binary decision for each one. 722 01:01:32,269 --> 01:01:34,078 Is this an object or not an object? 723 01:01:34,078 --> 01:01:37,578 So it's like a binary classification loss. 724 01:01:38,520 --> 01:01:43,658 So once you train this thing then faster R-CNN ends up being pretty darn fast. 725 01:01:43,658 --> 01:01:48,706 So now because we've eliminated this overhead from computing region proposals outside the network, 726 01:01:48,706 --> 01:01:53,588 now faster R-CNN ends up being very very fast compared to these other alternatives. 727 01:01:53,588 --> 01:01:59,388 Also, one interesting thing is that because we're learning the region proposals here you might imagine 728 01:01:59,388 --> 01:02:05,086 maybe what if there was some mismatch between this fixed region proposal algorithm and my data? 729 01:02:05,086 --> 01:02:16,320 So in this case once you're learning your own region proposals then you can overcome that mismatch if your region proposals are somewhat weird or different than other data sets. 730 01:02:16,320 --> 01:02:22,914 So this whole family of R-CNN methods, R stands for region, so these are all region-based methods 731 01:02:22,914 --> 01:02:30,716 because there's some kind of region proposal and then we're doing some processing, some independent processing for each of those potential regions. 732 01:02:30,716 --> 01:02:36,708 So this whole family of methods are called these region-based methods for object detection. 733 01:02:36,708 --> 01:02:40,676 But there's another family of methods that you sometimes see for object detection 734 01:02:40,676 --> 01:02:43,818 which is sort of all feed forward in a single pass. 735 01:02:43,818 --> 01:02:48,076 So one of these is YOLO for You Only Look Once. 736 01:02:48,076 --> 01:02:50,796 And another is SSD for Single Shot Detection 737 01:02:50,796 --> 01:02:54,067 and these two came out somewhat around the same time. 738 01:02:54,067 --> 01:03:02,348 But the idea is that rather than doing independent processing for each of these potential regions instead we want to try to treat this like a regression problem and just make 739 01:03:02,348 --> 01:03:06,156 all these predictions all at once with some big convolutional network. 740 01:03:06,156 --> 01:03:13,468 So now given our input image you imagine dividing that input image into some coarse grid, in this case it's a seven by seven grid 741 01:03:13,468 --> 01:03:18,556 and now within each of those grid cells you imagine some set of base bounding boxes. 742 01:03:18,556 --> 01:03:25,748 Here I've drawn three base bounding boxes like a tall one, a wide one, and a square one but in practice you would use more than three. 743 01:03:25,748 --> 01:03:32,858 So now for each of these grid cells and for each of these base bounding boxes you want to predict several things. 744 01:03:32,858 --> 01:03:41,868 One, you want to predict an offset off the base bounding box to predict what is the true location of the object off this base bounding box. 745 01:03:43,020 --> 01:03:51,460 And you also want to predict classification scores so maybe a classification score for each of these base bounding boxes. 746 01:03:51,460 --> 01:03:55,503 How likely is it that an object of this category appears in this bounding box. 747 01:03:55,503 --> 01:04:03,929 So then at the end we end up predicting from our input image, we end up predicting this giant tensor of seven by seven grid by 5B + C. 748 01:04:04,951 --> 01:04:12,700 So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box 749 01:04:12,700 --> 01:04:16,340 and C classification scores for our C categories. 750 01:04:16,340 --> 01:04:23,522 So then we kind of see object detection as this input of an image, output of this three dimensional tensor 751 01:04:23,522 --> 01:04:27,722 and you can imagine just training this whole thing with a giant convolutional network. 752 01:04:27,722 --> 01:04:30,682 And that's kind of what these single shot methods do 753 01:04:30,682 --> 01:04:41,180 where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do. 754 01:04:41,180 --> 01:04:48,539 And by the way, the region proposal network that gets used in faster R-CNN ends up looking quite similar to these where they have some set 755 01:04:48,539 --> 01:04:55,279 of base bounding boxes over some gridded image, another region proposal network does some regression plus some classification. 756 01:04:55,279 --> 01:04:59,196 So there's kind of some overlapping ideas here. 757 01:05:00,388 --> 01:05:13,892 So in faster R-CNN we're kind of treating the object, the region proposal step as kind of this fixed end-to-end regression problem and then we do the separate per region processing but now with these single shot methods 758 01:05:13,892 --> 01:05:19,761 we only do that first step and just do all of our object detection with a single forward pass. 759 01:05:19,761 --> 01:05:21,740 So object detection has a ton of different variables. 760 01:05:21,740 --> 01:05:23,950 There could be different base networks like VGG, 761 01:05:23,950 --> 01:05:29,601 ResNet, we've seen different metastrategies for object detection including this faster R-CNN 762 01:05:29,601 --> 01:05:31,820 type region based family of methods, 763 01:05:31,820 --> 01:05:34,060 this single shot detection family of methods. 764 01:05:34,060 --> 01:05:38,153 There's kind of a hybrid that I didn't talk about called R-FCN which is somewhat in between. 765 01:05:38,153 --> 01:05:39,580 There's a lot of different hyperparameters 766 01:05:39,580 --> 01:05:43,590 like what is the image size, how many region proposals do you use. 767 01:05:43,590 --> 01:05:48,022 And there's actually this really cool paper that will appear at CVPR this summer that does a really 768 01:05:48,022 --> 01:05:56,353 controlled experimentation around a lot of these different variables and tries to tell you how do these methods all perform under these different variables. 769 01:05:56,353 --> 01:05:58,676 So if you're interested I'd encourage you to check it out 770 01:05:58,676 --> 01:06:06,702 but kind of one of the key takeaways is that the faster R-CNN style of region based methods tends to give higher accuracies but ends up being 771 01:06:06,702 --> 01:06:08,972 much slower than the single shot methods 772 01:06:08,972 --> 01:06:12,486 because the single shot methods don't require this per region processing. 773 01:06:12,486 --> 01:06:17,204 But I encourage you to check out this paper if you want more details. 774 01:06:17,204 --> 01:06:24,621 Also as a bit of aside, I had this fun paper with Andre a couple years ago that kind of combined object detection with image captioning 775 01:06:24,621 --> 01:06:27,273 and did this problem called dense captioning 776 01:06:27,273 --> 01:06:32,472 so now the idea is that rather than predicting a fixed category label for each region, 777 01:06:32,472 --> 01:06:35,084 instead we want to write a caption for each region. 778 01:06:35,084 --> 01:06:41,033 And again, we had some data set that had this sort of data where we had a data set of regions together with captions 779 01:06:41,033 --> 01:06:46,153 and then we sort of trained this giant end-to-end model that just predicted these captions all jointly. 780 01:06:46,153 --> 01:06:50,962 And this ends up looking somewhat like faster R-CNN where you have some region proposal stage 781 01:06:50,962 --> 01:06:53,764 then a bounding box, then some per region processing. 782 01:06:53,764 --> 01:07:03,454 But rather than a SVM or a softmax loss instead those per region processing has a whole RNN language model that predicts a caption for each region. 783 01:07:03,454 --> 01:07:06,814 So that ends up looking quite a bit like faster R-CNN. 784 01:07:06,814 --> 01:07:11,524 There's a video here but I think we're running out of time so I'll skip it. 785 01:07:11,524 --> 01:07:17,897 But the idea here is that once you have this, you can kind of tie together a lot of these ideas 786 01:07:17,897 --> 01:07:21,508 and if you have some new problem that you're interested in tackling like dense captioning, 787 01:07:21,508 --> 01:07:26,860 you can recycle a lot of the components that you've learned from other problems like object detection and image captioning 788 01:07:26,860 --> 01:07:32,565 and kind of stitch together one end-to-end network that produces the outputs that you care about for your problem. 789 01:07:32,565 --> 01:07:36,567 So the last task that I want to talk about is this idea of instance segmentation. 790 01:07:36,567 --> 01:07:40,636 So here instance segmentation is in some ways like the full problem 791 01:07:40,636 --> 01:07:50,594 We're given an input image and we want to predict one, the locations and identities of objects in that image similar to object detection, but rather than just 792 01:07:50,594 --> 01:07:55,385 predicting a bounding box for each of those objects, instead we want to predict a whole segmentation mask 793 01:07:55,385 --> 01:08:02,785 for each of those objects and predict which pixels in the input image corresponds to each object instance. 794 01:08:02,785 --> 01:08:07,484 So this is kind of like a hybrid between semantic segmentation and object detection 795 01:08:07,484 --> 01:08:15,196 because like object detection we can handle multiple objects and we differentiate the identities of different instances so in this example 796 01:08:15,196 --> 01:08:21,924 since there are two dogs in the image and instance segmentation method actually distinguishes between the two dog instances 797 01:08:21,924 --> 01:08:32,765 and the output and kind of like semantic segmentation we have this pixel wise accuracy where for each of these objects we want to say which pixels belong to that object. 798 01:08:32,765 --> 01:08:38,247 So there's been a lot of different methods that people have tackled, for instance segmentation as well, 799 01:08:38,247 --> 01:08:49,868 but the current state of the art is this new paper called Mask R-CNN that actually just came out on archive about a month ago so this is not yet published, this is like super fresh stuff. 800 01:08:49,868 --> 01:08:52,675 And this ends up looking a lot like faster R-CNN. 801 01:08:52,676 --> 01:08:55,296 So it has this multi-stage processing approach 802 01:08:55,296 --> 01:09:05,622 where we take our whole input image, that whole input image goes into some convolutional network and some learned region proposal network that's exactly the same as faster R-CNN 803 01:09:05,622 --> 01:09:14,795 and now once we have our learned region proposals then we project those proposals onto our convolutional feature map just like we did in fast and faster R-CNN. 804 01:09:14,796 --> 01:09:21,228 But now rather than just making a classification and a bounding box for regression decision for each of those boxes we in addition 805 01:09:21,229 --> 01:09:27,478 want to predict a segmentation mask for each of those bounding box, for each of those region proposals. 806 01:09:27,478 --> 01:09:36,888 So now it kind of looks like a mini, like a semantic segmentation problem inside each of the region proposals that we're getting from our region proposal network. 807 01:09:36,889 --> 01:09:45,947 So now after we do this ROI aligning to warp our features corresponding to the region of proposal into the right shape, then we have two different branches. 808 01:09:45,948 --> 01:09:53,750 One branch will come up that looks exact, and this first branch at the top looks just like faster R-CNN and it will predict classification scores 809 01:09:53,750 --> 01:09:59,318 telling us what is the category corresponding to that region of proposal or alternatively whether or not it's background. 810 01:09:59,318 --> 01:10:04,596 And we'll also predict some bounding box coordinates that regressed off the region proposal coordinates. 811 01:10:04,596 --> 01:10:13,550 And now in addition we'll have this branch at the bottom which looks basically like a semantic segmentation mini network which will classify for each pixel 812 01:10:13,550 --> 01:10:17,780 in that input region proposal whether or not it's an object 813 01:10:17,780 --> 01:10:29,230 so this mask R-CNN problem, this mask R-CNN architecture just kind of unifies all of these different problems that we've been talking about today into one nice jointly end-to-end trainable model. 814 01:10:29,230 --> 01:10:36,710 And it's really cool and it actually works really really well so when you look at the examples in the paper they're kind of amazing. 815 01:10:36,710 --> 01:10:39,078 They look kind of indistinguishable from ground truth. 816 01:10:39,078 --> 01:10:49,497 So in this example on the left you can see that there are these two people standing in front of motorcycles, it's drawn the boxes around these people, it's also gone in and labeled all the pixels of those people and it's really small 817 01:10:49,497 --> 01:10:54,961 but actually in the background on that image on the left there's also a whole crowd of people standing very small in the background. 818 01:10:54,961 --> 01:10:58,628 It's also drawn boxes around each of those and grabbed the pixels of each of those images. 819 01:10:58,628 --> 01:11:08,028 And you can see that this is just, it ends up working really really well and it's a relatively simple addition on top of the existing faster R-CNN framework. 820 01:11:08,028 --> 01:11:15,108 So I told you that mask R-CNN unifies everything we talked about today and it also does pose estimation by the way. 821 01:11:15,108 --> 01:11:22,257 So we talked about, you can do pose estimation by predicting these joint coordinates for each of the joints of the person 822 01:11:22,257 --> 01:11:29,388 so you can do mask R-CNN to do joint object detection, pose estimation, and instance segmentation. 823 01:11:29,388 --> 01:11:35,246 And the only addition we need to make is that for each of these region proposals we add an additional little branch 824 01:11:35,246 --> 01:11:42,628 that predicts these coordinates of the joints for the instance of the current region proposal. 825 01:11:42,628 --> 01:11:51,715 So now this is just another loss, like another layer that we add, another head coming out of the network and an additional term in our multi-task loss. 826 01:11:51,715 --> 01:11:59,406 But once we add this one little branch then you can do all of these different problems jointly and you get results looking something like this. 827 01:11:59,406 --> 01:12:02,705 Where now this network, like a single feed forward network 828 01:12:02,705 --> 01:12:09,792 is deciding how many people are in the image, detecting where those people are, figuring out the pixels corresponding to each 829 01:12:09,792 --> 01:12:22,742 of those people and also drawing a skeleton estimating the pose of those people and this works really well even in crowded scenes like this classroom where there's a ton of people sitting and they all overlap each other and it just seems to work incredibly well. 830 01:12:22,742 --> 01:12:28,291 And because it's built on the faster R-CNN framework it also runs relatively close to real time 831 01:12:28,291 --> 01:12:36,061 so this is running something like five frames per second on a GPU because this is all sort of done in the single forward pass of the network. 832 01:12:36,061 --> 01:12:42,833 So this is again, a super new paper but I think that this will probably get a lot of attention in the coming months. 833 01:12:42,833 --> 01:12:45,430 So just to recap, we've talked. 834 01:12:45,430 --> 01:12:46,680 Sorry question? 835 01:12:53,800 --> 01:12:55,781 The question is how much training data do you need? 836 01:12:55,781 --> 01:13:00,948 So all of these instant segmentation results were trained on the Microsoft Coco data set 837 01:13:00,948 --> 01:13:08,320 so Microsoft Coco is roughly 200,000 training images. It has 80 categories that it cares about 838 01:13:08,320 --> 01:13:14,010 so in each of those 200,000 training images it has all the instances of those 80 categories labeled. 839 01:13:14,010 --> 01:13:23,285 So there's something like 200,000 images for training and there's something like I think an average of fivee or six instances per image. So it actually is quite a lot of data. 840 01:13:23,285 --> 01:13:32,000 And for Microsoft Coco for all the people in Microsoft Coco they also have all the joints annotated as well so this actually does have quite a lot 841 01:13:32,000 --> 01:13:36,669 of supervision at training time you're right, and actually is trained with quite a lot of data. 842 01:13:36,669 --> 01:13:42,050 So I think one really interesting topic to study moving forward is that we kind of know 843 01:13:42,050 --> 01:13:50,701 that if you have a lot of data to solve some problem, at this point we're relatively confident that you can stitch up some convolutional network that can probably do a reasonable job at that problem 844 01:13:50,701 --> 01:13:59,069 but figuring out ways to get performance like this with less training data is a super interesting and active area of research and I think that's something people will be spending 845 01:13:59,069 --> 01:14:03,301 a lot of their efforts working on in the next few years. 846 01:14:03,301 --> 01:14:08,068 So just to recap, today we had kind of a whirlwind tour of a whole bunch of different computer vision topics 847 01:14:08,068 --> 01:14:15,925 and we saw how a lot of the machinery that we built up from image classification can be applied relatively easily to tackle these different computer vision topics. 848 01:14:15,925 --> 01:14:22,835 And next time we'll talk about, we'll have a really fun lecture on visualizing CNN features.